-
Notifications
You must be signed in to change notification settings - Fork 53
Add timeout to gRPC workitem streaming #390
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This commit adds a timeout to the gRPC stream used to communicate with the backend. This was done because the backend could restart and drop the connection and the worker would not know. This causes the worker to hang and not receive any new work items. The fix is to reset the connection if a long enough period of time has passed between receiving anything on the stream. Signed-off-by: halspang <[email protected]>
14951b4 to
a098203
Compare
| while (!cancellation.IsCancellationRequested) | ||
| { | ||
| await foreach (P.WorkItem workItem in stream.ResponseStream.ReadAllAsync(cancellation)) | ||
| await foreach (P.WorkItem workItem in stream.ResponseStream.ReadAllAsync(tokenSource.Token)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this not throw if the connection is closed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We thought it would too, but if you go into the IAsyncStreamReader, a cancellation is actually just treated as the end of the stream and it returns normally.
| } | ||
| } | ||
|
|
||
| if (tokenSource.IsCancellationRequested || tokenSource.Token.IsCancellationRequested) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I am missing something, but it seems unlikely this line would ever be true. If IsCancellationRequested, then more likely than not stream.ResponseStream.ReadAllAsync throw OpeationCancelledException.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above. We thought this behavior was an odd choice for the stream reader as well, but it's documented that it doesn't throw.
|
Can you explain the observed code flow when the scheduler shuts down? Is an exception throw? If so, I would expect this line to be hit:
|
Signed-off-by: halspang <[email protected]>
It doesn't throw an exception, it returns as if the stream had just ended normally. So, once the cancellation is triggered, the foreach loop exits and we check the token. If the token is cancelled, we return. If the overall cancellation was cancelled, it will exit at that level as well. If not, it creates a new connection to the scheduler. https://grpc.github.io/grpc/csharp/api/Grpc.Core.IAsyncStreamReader-1.html |
jviau
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving to unblock, but I think this is treating a symptom and not the underlying problem. If a scheduler restart isn't closing the connection to the worker, thus ending the stream, something else is keeping that alive (such as a reverse proxy). This needs to be looked at and addressed, as it is violating some fundamental expectations of gRPC streams.
I would feel slightly better if we make this privately configurable by the AzureManaged package somehow. We already have some psuedo-internal options they use here, could add to that and have this behavior only enabled for DTS.
Signed-off-by: halspang <[email protected]>
Signed-off-by: halspang <[email protected]>
Agreed, but I think this is a safety mechanism we want anyways, regardless of what gRPC server implementation we're targeting, so I'm happy to go with this for now. Warning logs have been added so that we can observe this behavior and be reminded that it needs to be further root caused. |
This commit adds a timeout to the gRPC stream used to communicate with the backend. This was done because the backend could restart and drop the connection and the worker would not know. This causes the worker to hang and not receive any new work items. The fix is to reset the connection if a long enough period of time has passed between receiving anything on the stream.